Back

IEEE Journal of Biomedical and Health Informatics

Institute of Electrical and Electronics Engineers (IEEE)

Preprints posted in the last 90 days, ranked by how well they match IEEE Journal of Biomedical and Health Informatics's content profile, based on 14 papers previously published here. The average preprint has a 0.10% match score for this journal, so anything above that is already an above-average fit.

1
Detecting Mental Disorders in Social Media Using a Transformer-Based Ensemble of Binary Classifiers

Ovcharuk, O. V.; Mazurets, O.; Molchanova, M. V.; Kirpich, A.; Skums, P.; Sobko, O. V.; Barmak, O.; Krak, I.; Yakovlev, S.

2025-12-18 health informatics 10.64898/2025.12.16.25342390
Top 0.1%
53× avg
Show abstract

This study introduces a novel transformer-based ensemble framework for the multi-label detection of mental health disorders from social media posts. Unlike traditional multi-class approaches that often struggle with comorbidity, the proposed method employs a binary relevance strategy using fine-tuned DistilBERT models to identify co-occurring conditions, including depression, anxiety, and narcissistic personality disorder. To address class imbalance and optimize decision boundaries, the framework integrates a composite loss function (focal, dice, and log loss) and utilizes Youdens J statistic for threshold calibration. Validation on textual datasets demonstrates the efficacy of this approach, with an overall F1-score of 0.930 and AUC values exceeding 0.89. Comparative analysis suggests that decomposing complex diagnostic tasks into independent binary problems significantly reduces inter-class confusion relative to standard multi-class baselines. Furthermore, a qualitative error analysis highlights specific linguistic challenges, such as contextual polarity shifting, metaphorical ambiguity, and colloquial usage, that impact model specificity. The findings demonstrate the potential of the proposed framework as a robust screening tool for online mental health monitoring, while underscoring the necessity of human oversight to mitigate linguistic misinterpretations. Author summaryMental health disorders such as depression, anxiety, and narcissistic personality disorder represent a major global health challenge. This work proposes a method that employs transformer-based deep learning models to analyze social media posts for mental health assessment. A significant hurdle in automated diagnosis is that these conditions often occur together (comorbidity), whereas many existing Artificial Intelligence (AI) systems are designed to detect only a single disorder at a time. This study proposes a solution using a "multi-label" deep learning framework. Rather than relying on a single multi-class classifier, the approach utilizes an ensemble of specialized binary models, each trained to detect indicators of a specific disorder. This design reduces classification confusion between clinically similar conditions, such as depression and anxiety. The method was evaluated on publicly available datasets, had an F1-score of 0.930 which outperformed the existing approaches. The presented approach demonstrated high effectiveness, achieving better separation between clinically similar disorders compared to traditional methods. Crucially, the detailed investigation beyond the standard statistical metrics was performed which looked into specific models mistakes. It was found that, while the presented AI model is highly sensitive, it can be confused by the specifics of the language such as metaphors (e.g., "feeling like a pressure cooker"), negations (e.g., "I am not worried"), and the colloquial clinical terms. These results highlight that AI is a powerful tool which can be used for early screening and continuous monitoring on social media, while it still requires careful calibration and human oversight to distinguish between genuine symptoms and everyday emotional expression. The findings demonstrate that analyzing social media texts with advanced machine learning techniques can serve as a powerful complementary tool to clinical diagnostics. While not intended to completely replace professional evaluation, the proposed approach can help identify potential risks, promote earlier detection of mental health disorders, support preventive interventions, and ultimately improve access to care.

2
Small-sized Reasoning Language Models for Linguistic Screening of Alzheimer's Disease

Addepalli, V. r.; Abdalnabi, N.; Kummerfeld, E.; Hembroff, G.; Kiselica, A. M.; Rao, P.; Lee, K.

2026-01-01 health informatics 10.64898/2025.12.24.25342972
Top 0.1%
51× avg
Show abstract

Alzheimers disease (AD) is increasing in prevalence, and early detection is essential for timely care. Clinical services face growing demand, leading to delays in diagnostic appointments and increasing the risk of disease progression before evaluation. This work examines artificial intelligence (AI) methods for assessing cognitive status from linguistic features. The proposed architecture uses small language models (SLMs) to analyze speech patterns, and its compact design allows deployment on mobile devices. Recent reasoning-focused models, including Deepseek-R1 and Llama, were evaluated for dementia classification. Multiple fine-tuning strategies were compared, and the best model achieved 91% accuracy and an F1 score. The findings show that AI systems built on SLMs can achieve performance comparable to large language models, indicating their potential as efficient tools that may support health care providers through accessible pre-clinical screening for AD.

3
A deterministic safety pipeline for therapeutic AI in elderly assisted living

Sheriff, A.

2026-02-18 health informatics 10.64898/2026.02.17.26346507
Top 0.2%
47× avg
Show abstract

Over 54 million Americans are aged 65+, with depression affecting 25-49% and anxiety exceeding 30% of assisted living residents. AI systems employing agentic orchestration exhibit 0.5-2% failure rates--unacceptable where a single missed crisis can be fatal. We designed and bench-evaluated Lilo Engine, a 5-layer deterministic therapeutic pipeline replacing a prior multi-agent orchestrator. Safety is enforced through structural invariants: a Guardian layer with 4-gate OR crisis detection runs unconditionally on every input; a Reflector layer validates every output. Evaluated across 3,720 test scenarios, the system achieved 100% crisis recall (500/500 comprehensive scenarios), <5% false positive rate, and 28.7 ms detection latency--well within crisis response benchmarks. Intent classification reached 96.4% accuracy; generation quality 98.4%. The architecture reduced execution paths from 7+ to exactly 2, producing deterministic, HIPAA-auditable traces. Clinical validation with elderly populations is the essential next step.

4
MedOS: AI-XR-Cobot World Model for Clinical Perception and Action

Wu, Y. C.; Yin, M.; Shi, B.; Zhang, Z.; Yin, D.; Wang, X.; Wang, Y.; Fan, J.; Jin, R.; Wang, H.; Ying, K.; Pang, K.; Rojansky, R.; Curtis, C.; Bao, Z.; Wang, M.; Cong, L.

2026-02-23 health informatics 10.64898/2026.02.18.26345936
Top 0.2%
47× avg
Show abstract

Medicine historically separates abstract clinical reasoning from physical intervention. We bridge this divide with MedOS, a general-purpose embodied world model. Mimicking human cognition via a dual-system architecture, MedOS demonstrates superior reasoning on biomedical benchmarks and autonomously executes complex clinical research. To extend this intelligence physically, the system simulates medical procedures as a physics-aware model to foresee adverse events. Generating and validating on the MedSuperVision benchmark, MedOS exhibits spatial intelligence for reasoning and action. Crucially, we demonstrate that this platform democratizes clinical expertise and narrows the performance gap between junior and senior physicians. MedOS transforms clinical intervention towards a collaborative discipline where human intuition and machine intelligence co-evolve.

5
Leveraging the wearable 1-lead ECG signal: From cardiac rhythm to cardiac function assessment

van der Valk, V. O.; Atsma, D.; Scherptong, R.; Staring, M.

2026-02-07 cardiovascular medicine 10.64898/2026.02.02.26345091
Top 0.2%
46× avg
Show abstract

The electrocardiogram (ECG) is a critical tool in the diagnosis and monitoring of cardiovascular disease. Although traditional 12-lead ECGs offer comprehensive in-sights into the electrical activity of the heart, they typically require clinical settings and expert interpretation, which limits their accessibility. In contrast, smartwatch 1-lead ECGs can be recorded at home, allowing more frequent and rapid monitoring. This opens opportunities not only for early detection but also for enhancing patient autonomy. This study investigates whether 1-lead ECGs can provide information beyond heart rhythm, specifically whether they can be used to assess left ventricular function (LVF) using explainable deep learning models. Our findings show that LVF can be accurately predicted from 1-lead ECGs (AUC = 0.883), nearly matching the performance of 12-lead ECGs (AUC = 0.897). These results suggest that 1-lead ECGs, when combined with interpretable AI, could support broader clinical applications and empower patients, particularly in resource-limited or remote settings.

6
Quantifying the severity of patient safety events via statistical natural language processing

Bhadra, S.; Fong, A.; Sengupta, S.

2025-12-27 health informatics 10.64898/2025.12.22.25342876
Top 0.2%
46× avg
Show abstract

Medical errors are one of the leading causes of death in the United States. Several public databases have been built to record patient safety events across healthcare systems to better understand and improve safety hazards. These reports typically include both structured fields (e.g., event type, device, manufacturer) and unstructured data elements (free text narrative of what happened). The structured fields are usually restricted to a limited number of categories, whereas the unstructured fields allow the reporter to freely describe the event details. Thus, analyzing the unstructured text, rather than the structured fields, can reveal rich insights that can help improve patient safety. However, manual analysis of these databases is impractical due to their large size and the inherent subjectivity of manual interpretation. Therefore, we need new statistical algorithms to automate this process. In this paper, we develop a novel statistical technique to predict the severity level of a patient safety event based on its free text description. Using NLP techniques, we first express the raw event descriptions as numeric feature vectors and then use statistical techniques to model the severity of the events based on the feature vectors. We consider and compare three statistical approaches: multiclass (one-shot), ordinal, and hierarchical (two-step) models. To illustrate the proposed method, we analyzed a large text corpus of more than 7.7 million patient safety reports from FDAs MAUDE (Manufacturer and User Facility Device Experience) database. The proposed techniques correctly predicted the reported outcome of the events with above 94% accuracy. Furthermore, our techniques helped identify critical terms/phrases and provide a continuous-scale harm score, which can be more useful than a discrete severity level. Inspecting the misclassified reports, we discovered some likely occurrences of mislabeled reports which are correctly classified by our proposed approach.

7
Leveraging Generative Artificial Intelligence for Enhanced Data Augmentation in Emotion Intensity Classification: A Comprehensive Framework for Cross-Dataset Transfer Learning

Wieczorek, J.; Jiang, X.; Palade, V.; Trela, J.

2026-03-03 health informatics 10.64898/2026.02.23.26346928
Top 0.3%
45× avg
Show abstract

Data scarcity and stylistic heterogeneity pose major challenges for emotion intensity classification. This paper presents a cross-dataset augmentation framework that leverages prompt-conditioned generative models alongside deterministic and heuristic transformations to synthesize target-style examples for improved transfer learning. We introduce a unified taxonomy of augmentation strategies--Heuristic Lexical Perturbation (HLA), Prompt-Conditioned Generative Augmentation (CGA), Sequential Hybrid Pipeline (SHA), Rule-Guided Style Adaptation (DSGA), and Enhanced Hybrid Augmentation (EHA)--and detail an interpretability-oriented prompt engineering approach that conditions LLMs on authentic target exemplars and stylistic features extracted from the target dataset. Augmented datasets were evaluated using multi-dimensional quality metrics (transformation quality, stylistic consistency, BLEU/CHRF, Self-BLEU, uniqueness) and downstream classification via a two-phase BERT-LSTM training with rigorous statistical testing. During source dataset pretraining and subsequent target dataset fine-tuning, CGA achieved the highest single-method gains in F1 and accuracy (F1 = 0.8816; accuracy = 0.8819, 95% CI recalculated). HLA and SHA exhibited improved cross-domain stability, suggesting stronger domain-generalizable features. We observe systematic trade-offs between fluency, lexical diversity, and emotion fidelity: high surface similarity often correlates with classifier performance but does not fully capture affective authenticity. We discuss methodological pitfalls, propose best practices for emotion-aware augmentation, and provide reproducible artifacts (prompts, example transformations, evaluation scripts) to facilitate further research in affective NLP.

8
Comparison of LSTM and Transformer Models for Activities of Daily Living Recognition using In-Home Ambient Sensors

Abdalnabi, N.; Addepalli, V. r.; Sarker, S.; Kiselica, A.; Kummerfeld, E.; Rao, P.; Lee, K.

2026-01-06 health informatics 10.64898/2026.01.06.25342431
Top 0.3%
31× avg
Show abstract

This study evaluated Long Short-Term Memory (LSTM) and Transformer artificial intelligence (AI) models for recognizing Activities of Daily Living (ADLs) using data collected from a low-cost, non-invasive ambient in-home sensor system. Motion, temperature, luminance, and door-contact sensors were deployed in a two-participant home for 22 days, with ground truth established through volunteer logs and expert validation. Missing data were handled using Akima and linear interpolation. Models were trained using a 16/3-day train-validation split with the last three days reserved for testing to avoid temporal leakage. Performance assessments included participant-specific modeling, sequence-length variation, and hyperparameter tuning, using micro-accuracy and AUC-ROC as evaluation metrics. The Transformer consistently outperformed the LSTM, particularly for Participant 2 (AUC 0.91 vs. 0.87), and demonstrated superior adaptability to irregular time-series data. Findings underscore the feasibility of combining ambient sensing with AI for accurate ADL recognition and its potential to enable early detection of cognitive decline, reduce hospitalization risk, and alleviate caregiver burden.

9
A Clinical Theory-Driven Deep Learning Model for Interpretable Autism Severity Prediction

Hu, X.

2026-01-26 health informatics 10.64898/2026.01.25.26344792
Top 0.4%
30× avg
Show abstract

Autism spectrum disorder (ASD) affects a substantial proportion of children worldwide, yet clinical assessment of symptom severity remains resource-intensive and unevenly accessible. Artificial intelligence (AI) has transformative potential to support scalable and timely severity assessment from behavioral data, but existing approaches largely treat autism as a monolithic prediction target and rely on opaque models that are difficult for clinicians to interpret or trust. Moreover, prior multimodal methods typically integrate heterogeneous behavioral signals using ad hoc fusion strategies that are weakly grounded in clinical theory. We propose a clinical theory-driven deep learning model for interpretable autism severity assessment that explicitly operationalizes established clinical constructs into model design. Drawing on autism research, we represent social construct and motor construct as distinct latent components. These components are integrated through a structured cross-modal attention mechanism guided by a learnable alignment mask that encodes soft spatial correspondence priors between visual and kinematic representations. Theory-specific blocks then aggregate aligned tokens into construct embeddings, which are fused via instance-specific theory weights, yielding transparent symptom profiles aligned with clinical reasoning. Comprehensive experiments demonstrate the state-of-the-art performance of our model over existing baselines. Ablation studies validate that performance gains arise from theory-driven design choices. Analysis of the learned theory weights reveals systematic relationships between symptom profiles and severity, providing empirical support for the multidimensional structure of autism. This work demonstrates how clinical theory can be instantiated as empirically testable architectural designs in deep learning models, advancing both predictive utility and interpretability in healthcare AI systems.

10
Updated U.S. census benchmark sleep dataset v1.1

Jones, A. M.; Sheth, B. R.

2025-12-29 health informatics 10.64898/2025.12.27.25343087
Top 0.5%
29× avg
Show abstract

We previously documented and released a benchmark dataset for machine learning research on sleep stage classification [1]. Subsequently, it was pointed out in a preprint [2] that some recordings in the National Sleep Research Resource [3] include only binary wake-sleep annotations, instead of full sleep stage scoring using the Rechtschaffen and Kales (R&K) [4] or American Academy of Sleep Medicine (AASM) [5] standards. Because wake-sleep labels are an ontological mismatch and not just label noise, they do not belong in a dataset designed for full sleep stage classification. Therefore, we have updated our benchmark dataset (henceforth known as benchmark v1.0) to replace 16 recordings with suitable recordings from age- and sex-matched subjects, while all other dataset selection criteria and distributions have been preserved. Additionally, the total number of recordings and the composition of the training, validation, and testing sets remain unchanged. While this update is a minor revision, we want to distinguish its use from v1.0, and therefore have titled this update as benchmark v1.1. The file listings are provided on the GitHub repository (https://github.com/adammj/ecg-sleep-staging/).

11
A Wearable Multi-modal Sensor Array for Continuous Cuffless Blood Pressure Estimation

Rattray, J.; Nnadi, B.; Rapuri, S.; Harris, C. W.; Tenore, F.; Gamaldo, C.; Stevens, R. D.; Etienne-Cummings, R.

2026-01-26 cardiovascular medicine 10.64898/2026.01.25.26344788
Top 0.5%
29× avg
Show abstract

Blood pressure (BP) measurement is crucial for medical care, yet existing BP methods are either invasive, tethered, or suffer from low temporal resolution. Non-invasive continuous BP estimation thus remains a significant challenge. To address these challenges, this work presents a novel, non-invasive, multi-modal sensor designed for continuous blood pressure estimation using multiple biosignal modalities as feature inputs. From these input data, we extract cardiovascular timing intervals (e.g., pulse arrival time), which serve as key features for BP regression models, enabling continuous, non-invasive BP monitoring. We validate our algorithm with 16 healthy subjects using standard blood pressure cuff readings as ground truth. Our wearable, non-invasive multimodal and multinodal sensor array for integrated computation (MOSAIC) demonstrated promising performance and was able to predict systolic and diastolic BP across all study subjects with a MAE of 5.31 {+/-} 7.32 mmHg and 4.27 {+/-} 2.35 mmHg, respectively.

12
Thyroid Cancer Risk Prediction from Multimodal Datasets Using Large Language Model

Ray, P.

2026-03-06 health informatics 10.64898/2026.03.05.26347766
Top 0.6%
28× avg
Show abstract

Thyroid carcinoma is one of the most prevalent endocrine malignancies worldwide, and accurate preoperative differentiation between benign and malignant thyroid nodules remains clinically challenging. Diagnostic methods that medical practitioners use at present depend on their personal judgment to evaluate both imaging results and separate clinical tests, which creates inconsistency that leads to incorrect medical evaluations. The combination of radiological imaging with clinical information systems enables healthcare providers to enhance their capacity to make reliable predictions about patient outcomes while improving their decision-making abilities. The study introduces a deep learning framework that utilizes multiple data sources by combining magnetic resonance imaging (MRI) data with clinical text to predict thyroid cancer. The system uses a Vision Transformer (ViT) to obtain advanced MRI scan features, while a domain-adapted language model processes clinical documents that contain patient medical history and symptoms and laboratory results. The cross-modal attention system enables the system to merge imaging data with textual information from different sources, which helps to identify how the two types of data are interconnected. The system uses a classification layer to classify the fused features, which allows it to determine the probability of cancerous tumors. The experimental results show that the proposed multimodal system achieves better results than the unimodal base systems because it has higher accuracy, sensitivity, specificity, and AUC values, which help medical personnel to make better preoperative decisions.

13
Selectively Augmented Decision Tree for Explainable Dementia Detection

Kamalov, F.; Thabtah, F.; Peebles, D.; Ibrahim, A.

2026-02-04 health informatics 10.64898/2026.02.03.26345441
Top 0.6%
26× avg
Show abstract

Timely and accurate diagnosis of dementia remains a critical yet challenging task. Although machine learning (ML) techniques have shown considerable promise in dementia detection, their inherent complexity often results in opaque, "black-box" models that limit clinical acceptance and usability. In this paper, we propose a Selectively Augmented Decision Tree (SADT), an interpretable AI model specifically designed for dementia detection. SADT incorporates a structured three-phase pipeline consisting of feature selection, data balancing, and construction of a transparent decision tree classifier. We apply SADT to the OASIS dataset and evaluate it empirically, showing that SADT outperforms traditional ML benchmarks, validating its effectiveness. In addition to its superior performance, SADT also mirrors aspects of human decision-making in its sequential, rule-based prioritization of key features. This approach aligns with cognitive models of cue use and heuristic reasoning, making it not only clinically transparent but also psychologically aligned with how diagnostic decisions are often made in practice. SADTs strong predictive performance and interpretability grounded in human reasoning facilitates explanation and human scrutiny, and has the potential to improve both clinical decision-making and trust in AI-assisted diagnosis.

14
Uncertainty-aware personalized estimation of Parkinsons disease severity from longitudinal speech

Shahriar, K. A.

2026-02-05 health informatics 10.64898/2026.02.04.26345576
Top 0.6%
26× avg
Show abstract

Parkinsons disease is a progressive neurological disorder characterized by motor impairments whose severity is commonly assessed using the Unified Parkinsons Disease Rating Scale (UPDRS). Although clinically established, UPDRS assessment requires in-person evaluation by trained specialists and is inherently subjective, limiting its suitability for frequent monitoring. Speech production is affected early in Parkinsons disease and provides a non-invasive modality for remote symptom assessment. In this study, an uncertainty-aware personalized framework is proposed for estimating Parkinsons disease severity from speech signals. The approach integrates longitudinal temporal modeling of longitudinal speech recordings with patient-specific representations and a probabilistic latent disease state. Continuous motor UPDRS scores are estimated jointly with ordinal disease severity stages, enabling both fine-grained regression and clinically interpretable stratification. Predictive uncertainty is explicitly quantified, yielding confidence-aware severity estimates suitable for telemonitoring applications. The method is evaluated on a longitudinal speech dataset using a strict patient-wise split, ensuring that all test subjects are unseen during training. On the held-out test set, the proposed model achieves high predictive accuracy (mean absolute error 0.56 UPDRS points, root mean squared error 0.74, and coefficient of determination R2 = 0.99) for motor UPDRS estimation. Ordinal severity classification attains an accuracy of 0.92 across mild, moderate, and severe disease stages. Comparative experiments against classical machine learning methods and global temporal baselines demonstrate consistent performance improvements.These results indicate that personalized, uncertainty-aware modeling of speech signals can support accurate and clinically meaningful remote monitoring of Parkinsons disease severity.

15
Automated Burn Detection from Images Using Deep Learning Models: The Role of AI in the Triage of Burn Injuries

Durgude, A.; Soni, N.; Raghuwanshi, K. C.; Awasthi, S.; Uniyal, K.; Yadav, S.; Kakani, A.; Kesharwani, P.; Mago, V.; Vathulaya, M.; Rao, N.; Chattopadhyay, D.; Kapoor, A.; Bhimsaria, D.

2025-12-31 health informatics 10.64898/2025.12.24.25337638
Top 0.6%
26× avg
Show abstract

Burn injuries are a significant concern in developing countries due to limited infrastructure, and treating them remains a major challenge. The manual assessment of burn severity is subjective and depends, to a large extent, on individual expertise. Artificial intelligence can automate this task with greater accuracy and improved predictions, which can assist healthcare professionals in making more informed decisions while triaging burn injuries. This study established a model pipeline for detecting burn injuries in images using multiple deep learning models, including U-Net, DenseNet, ResNet, VGG, EfficientNet, and transfer learning with the Segment Anything Model2 (SAM2). The problem statement was divided into two stages: 1) removing the background and 2) burn skin segmentation. ResNet50, used as an encoder with a U-Net decoder, performs better for the background removal task, achieving an accuracy of 0.9757 and an intersection over union (Jaccard index) of 0.9480. DenseNet169, used as an encoder with a U-Net decoder, performs well in burn skin segmentation, achieving an accuracy of 0.9662 and an intersection over Union of 0.8504. The dataset collected during the project is available for download to facilitate further research and advancements (Link to dataset: https://geninfo.iitr.ac.in/projects). TBSA was estimated from predicted burn masks using scale-based calibration

16
Uspet: Unsupervised Segmentation Of Pet Images

Jaakkola, M.; Karpijoki, H.; Saari, T.; Rainio, O.; Li, A.; Knuuti, J.; Virtanen, K.; Klen, R.

2025-12-15 health informatics 10.64898/2025.12.15.25342254
Top 0.7%
25× avg
Show abstract

BackgroundSegmentation is a routine, yet time-consuming and subjective step in the analysis of positron emission tomography (PET) images. Automatic methods to do it have been suggested, but recent method development has focused on supervised approaches. The previously published unsupervised segmentation methods for PET images are outdated for the arising dynamic human total-body PET images now enabled by the evolving scanner technology. MethodsIn this study, we introduce an unsupervised general purpose automatic segmentation method for modern PET images consisting of tens of millions of voxels. We provide its implementation in an easy-to-use format and demonstrate its performance on two datasets of real human total-body images scanned using different radiotracers. Results and conclusionsOur results show that the suggested method can identify functionally distinct areas within the anatomical organs. Combined with anatomical segments obtained from other imaging modalities, this enables great potential to improve clinically meaningful segmentation and reduce time-consuming manual work.

17
Ed-Triage-Agent: A Framework For Human-Ai Collaborative Emergency Triage

Sharma, K.; Sivadas, H.; Reddy, S.

2026-02-18 health informatics 10.64898/2026.02.17.26346501
Top 0.7%
25× avg
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWEmergency Department triage is a critical decision-making process in which clinicians must rapidly assess patient acuity under high cognitive load and time pressure. We present ED-Triage-Agent (ETA), a multi-agent AI framework designed to augment clinical decision-making in Emergency Severity Index (ESI) classification through human-AI collaboration. The system operates in two phases: (1) autonomous patient intake via a conversational agent that collects structured symptom histories and (2) collaborative acuity assessment in which specialized agents prioritize patients for vital sign collection and generate ESI classifications with explicit clinical reasoning. Unlike monolithic AI prediction systems, ETA mirrors clinical workflow by supporting decisions at each triage stage while preserving clinician autonomy. We describe the system architecture, agent design principles, and a preliminary evaluation methodology using the ESI Implementation Handbook case studies (60 standardized cases). This work proposes a model for deploying multi-agent AI systems in time-critical clinical environments where explainability and human oversight are essential. Code and the evaluation framework are available at https://github.com/Karthick47v2/ED-Triage-Agent.

18
Severity of Depression and Anxiety Symptoms Manifest in Physiological and Behavioral Metrics Collected from a Consumer-Grade Wearable Ring

Sameh, A.; Azadifar, S.; Nauha, L.; Karmeniemi, M.; Niemela, M.; Farrahi, V.

2026-02-09 health informatics 10.64898/2026.02.06.26345566
Top 0.8%
24× avg
Show abstract

Wearable devices can collect changes in human behaviors related to mental health including depression and anxiety. Here, we examined whether and how digital metrics from a consumer-grade wearable smart ring (Oura Ring) differed by severity of depression and anxiety symptoms using data from a large-scale population-based sample of young adults (n=1,290, age range: 33-35). Participants wore the ring for two weeks, assessing sleep architecture, nocturnal heart rate (HR), heart rate variability (HRV), and movement intensity. Mental health symptoms were assessed using the Generalized Anxiety Disorder 7-item and Hopkins Symptom Checklist-25 scales. On average, participants with higher depression and/or anxiety symptoms had lower levels of rapid eye movement and had higher levels of deep and light sleep, elevated nocturnal HR, reduced HRV, and lower daytime movement compared to non-symptom individuals. Findings suggest that symptoms of depression and anxiety may manifest in physiological and behavioral metrics collected by consumer-grade wearable devices.

19
Drug Safety Agents Using Graphs and Ontologies

Mathialagan, C. S.; Nip, A.; Bhat, A.

2026-02-05 health informatics 10.64898/2026.02.04.26345582
Top 0.8%
24× avg
Show abstract

In pharmacovigilance, analyzing drug safety cases is often time consuming due to the abundance of laboratory data, complex medical histories, and intricate temporal relationships. Agentic AI systems can significantly reduce case processing time by assisting medical reviewers in surfacing clinically relevant evidence. However, previous studies have highlighted that large language models alone lack causal reasoning and evidence-based interpretability. To address these limitations, we present a knowledge-grounded safety case analysis framework that integrates disproportionality analysis to generate and prioritize potential adverse event hypotheses. Crucially, we introduce a novel hallucination-risk-aware execution planner that dynamically routes queries to the safest reasoning pathway, prioritizing deterministic graph retrieval over generative methods for clinically sensitive signals. The system demonstrates how structured medical knowledge and statistical evidence can be combined to support a reliable, explainable case assessment and can be readily extended with causal inference modules for deeper clinical reasoning.

20
Benchmarking Large Language Models for Intensive Care Unit Clinical Decision Support: A Dual Safety Evaluation of 26 Models on Consumer Hardware

Shlyakhta, T.

2026-02-10 health informatics 10.64898/2026.02.08.26345854
Top 0.8%
24× avg
Show abstract

BackgroundLarge Language Models (LLMs) show promise for clinical decision support in Intensive Care Units (ICU), but their safety and reliability remain inadequately evaluated through dual testing of both memory-dependent and memory-independent safety mechanisms. ObjectiveTo comprehensively evaluate LLMs using two independent safety tests: context-dependent contraindication memory (penicillin allergy recall) and context-independent authority resistance (Extended Milgram Test), revealing whether these represent unified or dissociated safety mechanisms. MethodsTwenty-three LLMs underwent automated testing via 24-hour ICU simulation on consumer hardware (NVIDIA RTX 3060 12GB). A subset of 26 models completed an Extended Milgram Test with five escalating harmful command scenarios. Scoring assessed safety compliance, Milgram resistance, conflict detection, and performance. ResultsCritical findings revealed dissociation between abstract ethics and clinical memory. While 65% of models achieved perfect Milgram resistance (100%), only 8.7% (n=2) correctly refused penicillin with allergy mention. Eight models demonstrated 100% Milgram resistance yet failed allergy recall (r = -0.39, p = 0.23). Only Granite 3.1 8B achieved perfect performance on both tests. ConclusionsAbstract ethical reasoning (refusing harmful orders in principle) is independent from concrete clinical memory (tracking patient-specific risks). Safe medical AI requires both capabilities--rarely both present. Dual safety testing should become mandatory for medical AI certification. HighlightsO_LIOnly 8.7% of tested LLMs passed critical safety tests for medication prescribing C_LIO_LIFirst study demonstrating dissociation between abstract ethics and clinical memory (r = -0.39) C_LIO_LIEight models refused all harmful orders but forgot documented allergies C_LIO_LIGranite 3.1 8B only model achieving perfect performance on both safety tests C_LIO_LIDual safety testing framework proposed for medical AI certification C_LI